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Abstract 

In this paper we present TDLeaf(A), a varia- 
tion on the TD(A) algorithm that enables it to 
be used in conjunction with game-tree search. 
We present some experiments in which our 
chess program "KnightCap" used TDLeaf(A) 
to learn its evaluation function while play- 
ing on the Free Internet Chess Server (FICS, 
f ics . onenet . net). The main success we re- 
port is that KnightCap improved from a 1650 rat- 
ing to a 2150 rating in just 308 games and 3 days 
of play. As a reference, a rating of 1650 corre- 
sponds to about level B human play (on a scale 
from E (1000) to A (1800)), while 2150 is human 
master level. We discuss some of the reasons for 
this success, principle among them being the use 
of on-line, rather than self-play. 

1 Introduction 

Temporal Difference learning, first introduced by Samuel 
[|5]] and later extended and formalized by Sutton ^ in his 
TD(A) algorithm, is an elegant technique for approximat- 
ing the expected long term future cost (or cost-to-go) of a 
stochastic dynamical system as a function of the current 
state. The mapping from states to future cost is imple- 
mented by a parameterized function approximator such as 
a neural network. The parameters are updated online af- 
ter each state transition, or possibly in batch updates after 
several state transitions. The goal of the algorithm is to im- 
prove the cost estimates as the number of observed state 
transitions and associated costs increases. 

Perhaps the most remarkable success of TD(A) is Tesauro's 
TD-Gammon, a neural network backgammon player that 
was trained from scratch using TD(A) and simulated self- 
play. TD-Gammon is competitive with the best human 



backgammon players [g]. In TD-Gammon the neural net- 
work played a dual role, both as a predictor of the expected 
cost-to-go of the position and as a means to select moves. 
In any position the next move was chosen greedily by eval- 
uating all positions reachable from the current state, and 
then selecting the move leading to the position with small- 
est expected cost. The parameters of the neural network 
were updated according to the TD(A) algorithm after each 
game. 

Although the results with backgammon are quite striking, 
there is lingering disappointment that despite several at- 
tempts, they have not been repeated for other board games 
such as othello, Go and the "drosophila of AI" — chess 
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Many authors have discussed the peculiarities of backgam- 
mon that make it particularly suitable for Temporal Dif- 
ference learning with self-play [fi ^ Eh. Principle among 
these are speed of play: TD-Gammon learnt from sev- 
eral hundred thousand games of self-play, representation 
smoothness: the evaluation of a backgammon position 
is a reasonably smooth function of the position (viewed, 
say, as a vector of piece counts), making it easier to find 
a good neural network approximation, and stochasticity: 
backgammon is a random game which forces at least a min- 
imal amount of exploration of search space. 

As TD-Gammon in its original form only searched one- 
ply ahead, we feel this list should be appended with: shal- 
low search is good enough against humans. There are two 
possible reasons for this; either one does not gain a lot 
by searching deeper in backgammon (questionable given 
that recent versions of TD-Gammon search to three-ply 
and this significantly improves their performance), or hu- 
mans are simply incapable of searching deeply and so TD- 
Gammon is only competing in a pool of shallow searchers. 
Although we know of no psychological studies investigat- 
ing the depth to which humans search in backgammon, it 
is plausible that the combination of high branching fac- 



tor and random move generation makes it quite difficult to 
search more than one or two-ply ahead. In particular, ran- 
dom move generation effectively prevents selective search 
or "forward pruning" because it enforces a lower bound on 
the branching factor at each move. 

In contrast, finding a representation for chess, othello or 
Go which allows a small neural network to order moves at 
one-ply with near human performance is a far more diffi- 
cult task. It seems that for these games, reliable tactical 
evaluation is difficult to achieve without deep lookahead. 
As deep lookahead invariably involves some kind of mini- 
max search, which in turn requires an exponential increase 
in the number of positions evaluated as the search depth 
increases, the computational cost of the evaluation func- 
tion has to be low, ruling out the use of expensive evalua- 
tion functions such as neural networks. Consequently most 
chess and othello programs use linear evaluation functions 
(the branching factor in Go makes minimax search to any 
significant depth nearly infeasible). 

In this paper we introduce TDLeaf(A), a variation on the 
TD(A) algorithm that can be used to learn an evaluation 
function for use in deep minimax search. TDLeaf(A) is 
identical to TD(A) except that instead of operating on the 
positions that occur during the game, it operates on the leaf 
nodes of the principal variation of a minimax search from 
each position (also known as the principal leaves). 

To test the effectiveness of TDLeaf(A), we incorporated it 
into our own chess program — KnightCap. KnightCap has 
a particularly rich board representation enabling relatively 
fast computation of sophisticated positional features, al- 
though this is achieved at some cost in speed (KnightCap is 
about 10 times slower than Crafty — the best public-domain 
chess program — and 6,000 times slower than Deep Blue). 
We trained KnightCap's linear evaluation function using 
TDLeaf(A) by playing it on the Free Internet Chess Server 
(FICS, f ics . onenet . net) and on the Internet Chess 
Club (ICC, chessclub.com). Internet play was used 
to avoid the premature convergence difficulties associated 
self-play[|.The main success story we report is that starting 
from an evaluation function in which all coefficients were 
set to zero except the values of the pieces, KnightCap went 
from a 1650-rated player to a 2 150-rated player in just three 
days and 308 games. KnightCap is an ongoing project with 
new features being added to its evaluation function all the 
time. We use TDLeaf(A) and Internet play to tune the co- 
efficients of these features. 



'Randomizing move choice is another way of avoiding prob- 
lems associated with self-play (this approach has been tried in Go 
[[|), but the advantage of the Internet is that more information is 
provided by the opponents play. 



The remainder of this paper is organized as follows. In 
section [l] we describe the TD(A) algorithm as it applies to 
games. The TDLeaf(A) algorithm is described in section^. 
Experimental results for internet-play with KnightCap are 
given in section |j. Section || contains some discussion and 
concluding remarks. 

2 The TD(A) algorithm applied to games 

In this section we describe the TD(A) algorithm as it applies 
to playing board games. We discuss the algorithm from the 
point of view of an agent playing the game. 

Let S denote the set of all possible board positions in the 
game. Play proceeds in a series of moves at discrete time 
steps t — 1,2, ... . At time t the agent finds itself in 
some position x t G S, and has available a set of moves, 
or actions A Xt (the legal moves in position xt). The agent 
chooses an action a G A Xt and makes a transition to state 
xt+i with probability p(xt, Xt+i, a). Here Xt+i is the po- 
sition of the board after the agent's move and the oppo- 
nent's response. When the game is over, the agent receives 
a scalar reward, typically "1" for a win, "0" for a draw and 
for a loss. 

For ease of notation we will assume all games have a fixed 
length of N (this is not essential). Let r(xw) denote the re- 
ward received at the end of the game. If we assume that the 
agent chooses its actions according to some function a{x) 
of the current state x (so that a(x) e A x ), the expected 
reward from each state x G S is given by 

J*(x) := E XN \ x r{x N ), (1) 

where the expectation is with respect to the transition prob- 
abilities p(x t , Xt+i, a(xt)) and possibly also with respect 
to the actions a(x t ) if the agent chooses its actions stochas- 
tically. 

For very large state spaces S it is not possible store the 
value of J* (x) for every x G S, so instead we might try 
to approximate J* using a parameterized function class 
J: S x R fc — > K, for example linear function, splines, neu- 
ral networks, etc. J(-, w) is assumed to be a differentiable 
function of its parameters w = (wi , . . . ,Wk). The aim is to 
find a parameter vector w G M fe that minimizes some mea- 
sure of error between the approximation J(-, w) and </*(•). 
The TD(A) algorithm, which we describe now, is designed 
to do exactly that. 

Suppose X\, . . . , xn-i, a; at is a sequence of states in one 
game. For a given parameter vector w, define the temporal 
difference associated with the transition xt Xt+i by 

d t := J(x t +i,w) - J(x t ,w). (2) 



Note that d t measures the difference between the reward 
predicted by J(-, w) at time t + 1, and the reward predicted 
by J(-,w) at time t. The true evaluation function J* has 
the property 

E Xt+1 \ Xt [J*(x t+1 )-J*(x t )} = 0, 

so if J(-,w) is a good approximation to J*, E x \ Xt dt 
should be close to zero. For ease of notation we will assume 
that J(xn,w) — r(xw) always, so that the final temporal 
difference satisfies 

djv-i = J(x N ,w)-J(x N -i,w) = r(x N )-J(x N -i,w). 

That is, djv-i is the difference between the true outcome 
of the game and the prediction at the penultimate move. 

At the end of the game, the TD(A) algorithm updates the 
parameter vector w according to the formula 



N-l 

w := w + a VJ(xt, w) 
t=i 



N-l 

E ^ 



(3) 



where V </(•, w) is the vector of partial derivatives of J with 
respect to its parameters. The positive parameter a con- 
trols the learning rate and would typically be "annealed" 
towards zero during the course of a long series of games. 
The parameter A G [0, 1] controls the extent to which tem- 
poral differences propagate backwards in time. To see this, 
compare equation (|j) for A = 0: 



N-l 



w :=w 



-w 



+ a^2 VJ(x t ,w)d t 
t=i 

N-l 

+ a^VJ(xi,w) J(x t +i, w) - J(x t , 



and A = 1: 



w := w + a 2, VJ(i(,w) t(xn) — J(xt, w) 



N-l 

E 

t=i 



(4) 



(5) 



Consider each term contributing to the sums in equations 
(Q) and For A = the parameter vector is being ad- 
justed in such a way as to move J(xt,w) — the predicted 
reward at time t — closer to J(x t+ i,w) — the predicted re- 
ward at time In contrast, TD(1) adjusts the parameter 
vector in such away as to move the predicted reward at time 
step t closer to the final reward at time step N. Values of 
A between zero and one interpolate between these two be- 
haviors. Note that is equivalent to gradient descent on 



Successive parameter updates according to the TD(A) al- 
gorithm should, over time, lead to improved predictions of 
the expected reward J(-,w). Provided the actions a(xt) 
are independent of the parameter vector w, it can be shown 
that for linear J(-, w), the TD(A) algorithm converges to a 
near-optimal parameter vector [^TJ. Unfortunately, there is 
no such guarantee if J(-, w) is non-linear [p"lj|, or if a(xt) 
depends on w [Q] . 

3 Minimax Search and TD(A) 

For argument's sake, assume any action a taken in state x 
leads to predetermined state which we will denote by x' a . 
Once an approximation J(-,w) to J* has been found, we 
can use it to choose actions in state x by picking the action 
a e A x whose successor state x' a minimizes the opponent's 
expected reward^ 



a*(x) :=argmin oe . J(x' a ,w). 



(6) 



This was the strategy used in TD-Gammon. Unfortunately, 
for games like othello and chess it is very difficult to ac- 
curately evaluate a position by looking only one move or 
ply ahead. Most programs for these games employ some 
form of minimax search. In minimax search, one builds 
a tree from position x by examining all possible moves 
for the computer in that position, then all possible moves 
for the opponent, and then all possible moves for the com- 
puter and so on to some predetermined depth d. The leaf 
nodes of the tree are then evaluated using a heuristic eval- 
uation function (such as J(-, w)), and the resulting scores 
are propagated back up the tree by choosing at each stage 
the move which leads to the best position for the player on 
the move. See figure |l] for an example game tree and its 
minimax evaluation. With reference to the figure, note that 
the evaluation assigned to the root node is the evaluation 
of the leaf node of the principal variation; the sequence of 
moves taken from the root to the leaf if each side chooses 
the best available move. 

In practice many engineering tricks are used to improve the 
performance of the minimax algorithm, a — [3 search being 
the most famous. 

Let Jd(x, w) denote the evaluation obtained for state x by 
applying J(-,w) to the leaf nodes of a depth d minimax 
search from x. Our aim is to find a parameter vector w 
such that Jd(-,w) is a good approximation to the expected 
reward J*. One way to achieve this is to apply the TD(A) 
algorithm to Jd(x, w). That is, for each sequence of posi- 



the error function E(w) :— ^tLi 1 r ( x N) — J(x t ,w) 



2 If successor states are only determined stochastically by the 
choice of a, we would choose the action minimizing the expected 
reward over the choice of successor states. 





Figure 1: Full breadth, 3 -ply search tree illustrating the 
minimax rule for propagating values. Each of the leaf 
nodes (H-O) is given a score by the evaluation function, 
J(-, w). These scores are then propagated back up the tree 
by assigning to each opponent's internal node the minimum 
of its children's values, and to each of our internal nodes the 
maximum of its children's values. The principle variation 
is then the sequence of best moves for either side starting 
from the root node, and this is illustrated by a dashed line 
in the figure. Note that the score at the root node A is the 
evaluation of the leaf node (L) of the principal variation. As 
there are no ties between any siblings, the derivative of A's 
score with respect to the parameters w is just V J(L, w). 



tions xi, . . . , xn in a game we define the temporal differ- 
ences 



d t := j d (x t+ i,w) - J d (x t ,w) 



(7) 



as per equation (g), and then the TD(A) algorithm (g) for 
updating the parameter vector w becomes 



N-l 



w := w - 



a ^2 ^Jd{x t , 



w 



t=i 



N-l 

E ^ 



(8) 



One problem with equation is that for d > 1, Jd(x, w) 
is not necessarily a differentiable function of w for all val- 
ues of w, even if J(-, w) is everywhere differentiable. This 
is because for some values of w there will be "ties" in the 
minimax search, i.e. there will be more than one best move 
available in some of the positions along the principal vari- 
ation, which means that the principal variation will not be 
unique (see figure^]). Thus, the evaluation assigned to the 
root node, Jd(x, w), will be the evaluation of any one of a 
number of leaf nodes. 

Fortunately, under some mild technical assumptions on the 
behavior of J(x, w), it can be shown that for each state x, 
the set of w <E R k for which Jd(x, w) is not differentiable 
has Lebesgue measure zero. Thus for all states x and for 



Figure 2: A search tree with a non-unique principal varia- 
tion (PV). In this case the derivative of the root node A with 
respect to the parameters of the leaf-node evaluation func- 
tion is multi-valued, either VJ(H,w) or VJ(L,u>). Ex- 
cept for transpositions (in which case H and L are identical 
and the derivative is single-valued anyway), such "colli- 
sions" are likely to be extremely rare, so in TDLeaf(A) we 
ignore them by choosing a leaf node arbitrarily from the 
available candidates. 



of w. Note that Jd(x,w) is also a continuous function of 
w whenever J(x, w) is a continuous function of w. This 
implies that even for the "bad" pairs (x, w), VJd{x, w) is 
only undefined because it is multi-valued. Thus we can 
still arbitrarily choose a particular value for VJd(x, w) if 
w happens to land on one of the bad points. 

Based on these observations we modified the TD(A) al- 
gorithm to take account of minimax search in an almost 
trivial way: instead of working with the root positions 
xi, . . . ,xn, the TD(A) algorithm is applied to the leaf po- 
sitions found by minimax search from the root positions. 
We call this algorithm TDLeaf(A). Full details are given in 
figure |. 



4 TDLeaf(A) and Chess 

In this section we describe the outcome of several ex- 
periments in which the TDLeaf(A) algorithm was used 
to train the weights of a linear evaluation function in 
our chess program "KnightCap". KnightCap is a reason- 
ably sophisticated computer chess program for Unix sys- 
tems. It has all the standard algorithmic features that 
modern chess programs tend to have as well as a num- 
ber of features that are much less common. For more 
details on KnightCap, including the source code, see 
wwwsyseng . anu . edu . au/ lsg. 



"almost all" 



w £ 



, Jd(x, w) is a differentiable function 



Let J(-, w) be a class of evaluation functions parameterized by w G 


R k . Let xi, 


. . . , xn be N positions that occurred 


during the course of a game, with r(x]y) the outcome of the game. For notational convenience set J(xn,w) := t(xn)- 


1. For each state Xi, compute Jd(xi,w) by performing minimax search to depth d from Xi and using J(-, w) to score the 


leaf nodes. Note that d may vary from position to position. 






2. Let x\ denote the leaf node of the principle variation starting at xi. 


If there is more than one principal variation, choose 


a leaf node from the available candidates at random. Note that 






J d (Xi,Ul) = J(x\, 


w). 


(9) 


3. For t = 1, . . . , N — 1, compute the temporal differences: 






d t := J{x\ +1 ,w) - J(x l t ,w). 


(10) 


4. Update w according to the TDLeaf(A) formula: 






JV-l 


N-l 




w :— w + a VJ(i[,ic) 


E ^ 


(11) 


t=i 







Figure 3: The TDLeaf(A) algorithm 



4.1 Experiments with KnightCap 

In our main experiment we took KnightCap's evaluation 
function and set all but the material parameters to zero. 
The material parameters were initialized to the standard 
"computer" values: 1 for a pawn, 4 for a knight, 4 for a 
bishop, 6 for a rook and 12 for a queen. With these pa- 
rameter settings KnightCap (under the pseudonym "Wimp- 
Knight") was started on the Free Internet Chess server 
(FICS, fics.onenet.net) against both human and 
computer opponents. We played KnightCap for 25 games 
without modifying its evaluation function so as to get a rea- 
sonable idea of its rating. After 25 games it had a blitz (fast 
time control) rating of 1650 ± 50[| which put it at about 
B-grade human performance (on a scale from E (1000) to 
A (1800)), although of course the kind of game KnightCap 
plays with just material parameters set is very different to 
human play of the same level (KnightCap makes no short- 
term tactical errors but is positionally completely ignorant). 
We then turned on the TDLeaf(A) learning algorithm, with 
A = 0.7 and the learning rate a = 1.0. The value of A was 
chosen heuristically, based on the typical delay in moves 
before an error takes effect, while a was set high enough 
to ensure rapid modification of the parameters. A couple of 
minor modifications to the algorithm were made: 



" the standard deviation for all ratings reported in this section 
is about 50 



The raw (linear) leaf node evaluations J(x\,w) were 
converted to a score between — 1 and 1 by computing 



= tanh 0J(x[, w) 



This ensured small fluctuations in the relative values 
of leaf nodes did not produce large temporal differ- 
ences (the values v\ were used in place of J{x\,w) 
in the TDLeaf(A) calculations). The outcome of the 
game r(x^) was set to 1 for a win, —1 for a loss 
and for a draw. (3 was set to ensure that a value 

of tanh f3J(x\,w) 



0.25 was equivalent to a ma- 



terial superiority of 1 pawn (initially). 



The temporal differences, d t 



J t+i 



vL were mod- 



ified in the following way. Negative values of d t 
were left unchanged as any decrease in the evalua- 
tion from one position to the next can be viewed as 
mistake. However, positive values of dt can occur 
simply because the opponent has made a blunder. To 
avoid KnightCap trying to learn to predict its oppo- 
nent's blunders, we set all positive temporal differ- 
ences to zero unless KnightCap predicted the oppo- 
nent's movefl 



In a later experiment we only set positive temporal differ- 
ences to zero if KnightCap did not predict the opponent's move 
and the opponent was rated less than KnightCap. After all, pre- 
dicting a stronger opponent's blunders is a useful skill, although 
whether this made any difference is not clear. 



• The value of a pawn was kept fixed at its initial value 
so as to allow easy interpretation of weight values 
as multiples of the pawn value (we actually experi- 
mented with not fixing the pawn value and found it 
made little difference: after 1764 games with an ad- 
justable pawn its value had fallen by less than 7 per- 
cent). 

Within 300 games KnightCap's rating had risen to 2150, an 
increase of 500 points in three days, and to a level compa- 
rable with human masters. At this point KnightCap's per- 
formance began to plateau, primarily because it does not 
have an opening book and so will repeatedly play into weak 
lines. We have since implemented an opening book learn- 
ing algorithm and with this KnightCap now plays at a rating 
of 2400-2500 (peak 2575) on the other major internet chess 
server: ICC, chessclub . corr|]| It often beats Interna- 
tional Masters at blitz. Also, because KnightCap automati- 
cally learns its parameters we have been able to add a large 
number of new features to its evaluation function: Knight- 
Cap currently operates with 5872 features (1468 features in 
four stages: opening, middle, ending and mating^)). With 
this extra evaluation power KnightCap easily beats ver- 
sions of Crafty restricted to search only as deep as itself. 
However, a big caveat to all this optimistic assessment is 
that KnightCap routinely gets crushed by faster programs 
searching more deeply. It is quite unlikely this can be eas- 
ily fixed simply by modifying the evaluation function, since 
for this to work one has to be able to predict tactics stat- 
ically, something that seems very difficult to do. If one 
could find an effective algorithm for "learning to search se- 
lectively" there would be potential for far greater improve- 
ment. 

Note that we have twice repeated the learning experiment 
and found a similar rate of improvement and final perfor- 
mance level. The rating as a function of the number of a 
games from one of these repeat runs is shown in figure Q 
(we did not record this information in the first experiment). 
Note that in this case KnightCap took mearly twice as long 
to reach the 2150 mark, but this was partly because it was 
operating with limited memory (8Mb) until game 500 at 
which point the memory was increased to 40Mb (Knight- 
Cap's search algorithm — MTD(f) [[|] — is a memory inten- 
sive variant of a-[3 and when learning KnightCap must 

5 There appears to be a systematic difference of around 200- 
250 points between the two servers, so a peak rating of 2575 on 
ICC roughly corresponds to a peak of 2350 on FICS. We trans- 
ferred KnightCap to ICC because there are more strong players 
playing there. 

6 In reality there are not 1468 independent "concepts" per stage 
in KnightCap's evaluation function as many of the features come 
in groups of 64, one for each square on the board (like the value 
of placing a rook on a particular square, for example) 
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Figure 4: KnightCap's rating as a function of games played 
(second experiment). Learning was turned on at game 0. 



store the whole position in the hash table so small mem- 
ory really hurts the performance). Another reason may also 
have been that for a portion of the run we were performing 
paramater updates after every four games rather than every 
game. 

Plots of various parameters as a function of the number of 
games played are shown in Figure ^] (these plots are from 
the same experiment in figure Q). Each plot contains three 
graphs corresponding to the three different stages of the 
evaluation function: opening, middle and ending]]. 

Finally, we compared the performance of KnightCap with 
its learnt weight to KnightCap's performance with a set of 
hand-coded weights, again by playing the two versions on 
ICC. The hand-coded weights were close in performance 
to the learnt weights (perhaps 50-100 rating points worse). 
We also tested the result of allowing KnightCap to learn 
starting from the hand-coded weights, and in this case it 
seems that KnightCap performs better than when start- 
ing from just material values (peak performance was 2632 
compared to 2575, but these figures are very noisy). We are 
conducting more tests to verify these results. However, it 
should not be too surprising that learning from a good qual- 
ity set of hand-crafted parameters is better than just learn- 
ing from material parameters. In particular, some of the 
handcrafted parameters have very high values (the value of 
an "unstoppable pawn", for example) which can take a very 
long time to learn under normal playing conditions, partic- 
ularly if they are rarely active in the principal leaves. It is 

7 KnightCap actually has a fourth and final stage "mating" 
which kicks in when all the pieces are off, but this stage only uses 
a few of the coefficients (opponent's king mobiliity and proximity 
of our king to the opponent's king). 
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Figure 5: Evolution of two paramaters (bonus for castling 
and penalty for a doubled pawn) as a function of the num- 
ber of games played. Note that each parameter appears 
three times: once for each of the three stages in the evalua- 
tion function. 



not yet clear whether given a sufficient number of games 
this dependence on the initial conditions can be made to 
vanish. 

4.2 Discussion 

There appear to be a number of reasons for the remarkable 
rate at which KnightCap improved. 

1. As all the non-material weights were initially zero, 
even small changes in these weights could cause very 
large changes in the relative ordering of materially 
equal positions. Hence even after a few games Knight- 
Cap was playing a substantially better game of chess. 



close in parameter space to many far superior param- 
eter settings. 

3. Most players on FICS prefer to play opponents of sim- 
ilar strength, and so KnightCap's opponents improved 
as it did. This may have had the effect of guiding 
KnightCap along a path in weight space that led to 
a strong set of weights. 

4. KnightCap was learning on-line, not by self-play. The 
advantage of on-line play is that there is a great deal 
of information provided by the opponent's moves. In 
particular, against a stronger opponent KnightCap was 
being shown positions that 1) could be forced (against 
KnightCap's weak play) and 2) were mis-evaluated by 
its evaluation function. Of course, in self -play Knight- 
Cap can also discover positions which are misevalu- 
ated, but it will not find the kinds of positions that 
are relevant to strong play against other opponents. In 
this setting, one can view the information provided by 
the opponent's moves as partially solving the "explo- 
ration" part of the exploration/exploitation tradeoff. 

To further investigate the importance of some of these 
reasons, we conducted several more experiments. 

Good initial conditions. 

A second experiment was run in which KnightCap's co- 
efficients were all initialised to the value of a pawn. The 
value of a pawn needs to be positive in KnightCap be- 
cause it is used in many other places in the code: for 
example we deem the MTD search to have converged if 
a < (3 + 0.07*PAWN. Thus, to set all parameters equal to 
the same value, that value had to be a pawn. 

Playing with the initial weight settings KnightCap had a 
blitz rating of around 1250. After more than 1000 games 
on FICS KnightCap's rating has improved to about 1550, 
a 300 point gain. This is a much slower improvement 
than the original experiment. We do not know whether 
the coefficients would have eventually converged to good 
values, but it is clear from this experiment that starting 
near to a good set of weights is important for fast con- 
vergence. An interesting avenue for further exploration 
here is the effect of A on the learning rate. Because the 
initial evaluation function is completely wrong, there 
would be some justification in setting A = 1 early on so 
that KnightCap only tries to predict the outcome of the 
game and not the evaluations of later moves (which are 
extremely unreliable). 



2. It seems to be important that KnightCap started out 
life with intelligent material parameters. This put it 



Self-Play 

Learning by self-play was extremely effective for TD- 



Gammon, but a significant reason for this is the randomness 
of backgammon which ensures that with high probabil- 
ity different games have substantially different sequences 
of moves, and also the speed of play of TD-Gammon 
which ensured that learning could take place over several 
hundred-thousand games. Unfortunately, chess programs 
are slow, and chess is a deterministic game, so self-play by 
a deterministic algorithm tends to result in a large number 
of substantially similar games. This is not a problem if the 
games seen in self-play are "representative" of the games 
played in practice, however KnightCap's self-play games 
with only non-zero material weights are very different to 
the kind of games humans of the same level would play. 

To demonstrate that learning by self-play for KnightCap is 
not as effective as learning against real opponents, we ran 
another experiment in which all but the material parame- 
ters were initialised to zero again, but this time KnightCap 
learnt by playing against itself. After 600 games (twice as 
many as in the original FICS experiment), we played the re- 
sulting version against the good version that learnt on FICS 
for a further 100 games with the weight values fixed. The 
self-play version scored only 1 1 % against the good FICS 
version. 

Simultaneously with the work presented here, Beal 
and Smith [[jj reported positive results using essentially 
TDLeaf(A) and self-play (with some random move choice) 
when learning the parameters of an evaluation function that 
only computed material balance. However, they were not 
comparing performance against on-line players, but were 
primarily investigating whether the weights would con- 
verge to "sensible" values at least as good as the naive (1, 3, 
3, 5, 9) values for (pawn, knight, bishop, rook, queen) (they 
did, within 2000 games, and using a value of A = 0.95 
which supports the discussion in "good initial conditions" 
above). 

5 Conclusion 

We have introduced TDLeaf(A), a variant of TD(A) suitable 
for training an evaluation function used in minimax search. 
The only extra requirement of the algorithm is that the leaf- 
nodes of the principal variations be stored throughout the 
game. 

We presented some experiments in which a chess evalua- 
tion function was trained from B-grade to master level us- 
ing TDLeaf(A) by on-line play against a mixture of human 
and computer opponents. The experiments show both the 
importance of "on-line" sampling (as opposed to self-play) 
for a deterministic game such as chess, and the need to 
start near a good solution for fast convergence, although 
just how near is still not clear. 



On the theoretical side, it has recently been shown that 
TD(A) converges for linear evaluation functions [[lT} (al- 
though only in the sense of prediction, not control). An 
interesting avenue for further investigation would be to de- 
termine whether TDLeaf(A) has similar convergence prop- 
erties. 
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